Binary Data Formats

Let's learn about the common binary encoding styles used in modern API architectures.

Introduction#

Text-based data representation is often preferred when interacting with different clients, especially when client implementations are outside of an organization’s control. This is because it’s easier to analyze and debug when the data is human-readable. For internal use, such as in iOS and Android applications (where the organization controls both ends of the communication tightly), there’s more freedom and flexibility in choosing the data format, and we can move to formats that are more compact, faster to parse, and lighter on the database. The binary encoding formats share the following benefits in common:

  • Machine-friendly: The data is in binary format and can be processed with little or no preprocessing.

  • Schema dependent: Data-related information is defined in the schema document.

  • Portability: Binary formats can be easily deserialized into different languages ​​as long as they have the same encoding and decoding schema.

  • Precision support: Binary format can store large numbers (variable length integers) and floating point numbers with greater precision.

  • Standardized: The binary format uses a well-defined schema that's well-documented and can be standardized across different implementations.

A schema document contains templates on how to structure data into a specific format. For example, the following schema defined in JSON represents the structure of a post object to be encoded in binary format.

{
    "type": "record",
    "name": "Post",
    "fields": [
        {"name": "postURL", "type": "string"},
        {"name": "postID", "type": "long"},
        {"name": "postTags", "type": {"type": "array", "items": "string"}}
    ]
}

Here, we are defining that the data is divided into different records, where each record is an object of class Post and it has the following attributes:

  • The variable postURL is of type string.
  • The variable postID is of type long.
  • The variable postTags is an array of elements of the type string.

Many binary formats exist, such as MessagePack, Fast Infoset, BSON, WBXML, Protobuf, Thrift, Avro, and so on. Let's see how the message object above looks when encoded in MessagePack, a binary variant of JSON.

Textual representation of MessagePack encoding

The binary version of the above message object is demonstrated in the image below:

The binary representation of MessagePack encoding
The binary representation of MessagePack encoding

MessagePack and other binary variants of JSON and XML follow a schemaless encoding style that increases the size of the encoded data and are not common in API development. For example, in the above illustration, disablelinks takes an extra 12 bytes to encode the field name, whereas its value is only one byte, representing true.

Thrift and Protobuf#

The most commonly used binary data formats are Protocol buffers (Protobuf) and Apache Thrift. Google developed Protocol Buffers, and Facebook developed Apache Thrift. Both these binary encoding libraries follow similar rules and have been open-sourced for standardized public use. They have a similar-looking schema, shown below:

Apache Thrift
struct Node {
    1: required i64 nodeID,
    2: optional string nodeContent,
    3: optional list<string> nodeLinks
}
Protocol buffers
message Node {
    required int64 node_id        = 1;
    optional string node_content  = 2;
    repeated string node_links    = 3;
}

We can breakdown these schema definitions as follows:

  • Tags: Each field is identified by the numbers equivalent to the names. For example, 1 identifies the fields nodeID for Thrift and node_id for Protobuf. These tags are encoded in data rather than field names, allowing for more compact data representation.

  • Labels: The markers required, optional, and repeated (Protobuf only) imply checks to determine whether to raise an error when encoding or decoding data. These markers are only added to the schema but are not encoded in the transmitted data.

  • Types: Each field can define its data type (string, i64, int64, etc.) by specifying a type marker. Protobuf does not have a type to represent arrays, and it uses the repeated label to show that the data can have multiple occurrences.

  • Names: The markers (nodeID, node_id, etc.) represent the names of the data entity. One thing worth noting here is that these are not identifiers for fields, which allow the name to be changed without affecting the implementation.

Note: The use of the label, repeated, in Protobuf allows flexibility to change a single variable to an array of variables in different schema versions, but it’s only valid for optional fields, otherwise it can cause compatibility problems. To read more, see Thrift documentation and Protobuf documentation.

Apache Avro#

Apache Avro is also an open-source binary encoding format developed by Hadoop. Avro supports two schema styles, Avro IDL and JSON-based definitions. JSON is well-known in the industry and allows developers to create schemas without having to learn another language. Avro doesn’t add tags and types like Thrift and Protobuf and uses a value-only data encoding style. It relies on the schema version defined in the data reader to identify the fields when decoded. If there are any conflicts due to different versions of the writer's and reader's schema, the Avro library resolves these conflicts through a side-by-side comparison of these versions, as shown below.

Avro schema resolution of a post record
Avro schema resolution of a post record

Here, the field name is used as the field's identifier, so the records can be in any order as long as the field name remains the same. If the reader code finds a field that is not in the reader schema, it just ignores that field. On the other hand, if the record doesn’t contain the fields expected by the reader schema, the reader code populates it with the default values ​​specified in the reader schema.

To read more on Apache Avro see the Apache Avro Documentation.

Point to Ponder

Question

How does the reader know the version of the schema is used to encode the data?

Hide Answer

Avro proposes several solutions to specify the writer’s schema in different scenarios. Some of them are as follows:

  • To store large files, it attaches the writer’s schema once at the beginning of the file.
  • To store individual records, it maintains different writer’s versions in the database and specifies the version number in each individual record.
  • To send data on the internet, it shares the writer’s schema when setting up a stateful connection.

Avro is very flexible with dynamically generated schemas. It handles schema transitions efficiently when decoding, and administrators don’t have to worry about manually updating field names, tags, types, and so on, in order to make them compatible with new versions. Thrift and Protobuf, on the other hand, generate dynamic code and perform compile-time checks (type safety, and so forth) for different statically typed languages ​​such as C, C++, Java, C#, and more. Avro also supports dynamic code generation as optimization over statically typed languages. However, dynamic code generation is not required, especially when dealing with dynamically typed languages.

Advantages and Limitations#

Binary formats are much more compact and faster in terms of processing than text-based formats. They’re flexible with clearly defined version compatibility and dynamically-generated code support. However, when it comes to human interaction, they require preprocessing to make them human-readable. Let’s see how these binary formats map to our general criteria of an optimal data format, outlined in the table below:

Feature Support Comparison

Feature

Thrift

Protobuf

Avro

Schema document

Static

Static

Static and dynamic

Human-readable

No

No

No

Low latency

Fast data transfer due to compact size

Faster data transfer due to more compact size

Fastest data transfer due to most compact size

Standardized

Yes

Yes

Yes

Machine friendly

Yes

Yes

Yes

Interoperable

Yes

Yes

Yes

Flexible

Full compatibility

Full compatibility

Full compatibility

Recommendations#

Choosing a data exchange format for an API is always debatable, because each format is suitable for some use cases and may have drawbacks for others. However, this section provides general recommendations for using the textual and binary data formats discussed in this chapter.

  • JSON is probably the ideal choice when dealing with small groups of systems, especially those developed in JavaScript, where human-readability is essential.

  • XML is probably the ideal format of choice when dealing with various systems with complex data structures that require markup and human-readability.

  • Avro is likely the ideal choice when dealing with large files, frequent interactions of data encoded using different schema versions, and significant encoding sizes.

  • Protobuf or Thrift may serve the purpose when network latency, interprocess communication, and processing speed are paramount.

The choice is not limited to the data formats discussed above; other formats, such as YAML, FormData, SQL, and so on, can also be considered. We can also introduce a custom format when an existing format doesn't meet our needs, keeping in mind the cost and effort required for internal or external usage. In other words, usage within the system or when sending information to end users.

Quiz

Question

Why is binary data small in size and high performing compared to textual ones?

Hide Answer

Binary representation uses compression algorithms like LZ77, RLE, etc., to reduce file sizes. Although, the real gain is not only through compression, but also through reducing markup (extra information required for data serialization and deserialization). Most binary formats define schemas for serializing/deserializing data, which are either transmitted once or saved in the database to reduce the size of each request. The smaller size makes the binary data optimized for storing and sending data over the wire. Moreover, binary data doesn’t need to be converted to another format for processing (since the CPU also performs computations in binary). Therefore, binary data is the most compact and high-performing format.

Textual Data Formats

Introduction to Web API Architectural Styles